In Germany we have the following saying: Everything that you need to know about power has already been said – but not yet by everyone. Hence, in what follows, I take my turn and try to explain the statistical concept of power using my own words.
Note that this post is written primarily for students, to provide some guidelines for how to run power analyses when writing theses or reports. I’ll also provide the R-code necessary to be able to execute all the analyses, so everyone new to the topic or R might also benefit. Although I’ll display the code already in the blog, you can also download everything from my github.
Power is an extremely relevant statistical concept. Really, the importance of having a well-powered study cannot be underestimated. I would say that when assessing a study’s quality, having adequate power belongs to one of the most crucial indicators.
If you think that this is an exaggeration or if you feel like power is more of a nuisance, one of those pesky concepts that nerdy statisticians want you to implement but that mostly distracts from doing real research, I would reply the following:
If you really think that, you probably haven’t yet fully understood power.
At least that was the way I felt. Once it really clicked, it became an absolute nobrainer that running a priori power tests is one of the most useful things you can do when doing research.
In short, the statistical power of a study describes the probablility of you being able to cry “heureka! I found a significant effect!” Often, though not always, that’s exactly what you want, which is why ideally you would like to increase that probability (but see my thoughts on the SESOI below).
More specifically, the power of a study can be calculated once you know …
But calculations are one thing. What’s even more illustrative are simulations. I think I only really started to understand power once I saw some tutorials with simulations of actual data (for example, Laken’s excellent mooc Improving your statistical inferences comes to mind).
The main advantage of simulating your own data is that you can actually specifiy the true effect in the population. So, by definition, you know what result your study should reveal. To illustrate the importance of power, let us create some data for a typical topical research question from my own field, media psychology.
Personally, I’m very much interested in the so-called privacy paradox, which states that the privacy concerns of people are unrelated to their actual information sharing behavior (e.g., Barnes 2006). It’s mostly safe to say that by now the privacy paradox has been refuted. For example, a recent meta analysis found that privacy concerns and information sharing exhibits a relation of r = -.13 (Baruh, Secinti, and Cemalcilar 2017). Hence, if people are more concerned, they are (slighty) less willing to share information.
So let’s imagine that we want to find out whether the privacy paradox exists among the students in Hohenheim. Let’s start simulating some data!
First off, we’re going to load some packages and set a seed (necessary so that if you’re rerunning the analyses you will get the same results).
# load packages
library(ggplot2); library(magick); library(pwr); library(tidyverse)
# set seed
set.seed(170819)
We can test our research question using a simple Pearson correlation. Building on Baruh, Secinti, and Cemalcilar (2017), let us define that the actual correlation between privacy concerns and information sharing is r = -.13.
(For simplicity’s sake, I’m sticking with standardized effects throughout this blog. I know that unstandardized would be preferable, but it’s a bit easier both from a didactical and a data-simulation perspective.)
Finally, we stick with our typical alpha level of 5%, and in Hohenheim there are currently 10.000 students (our population).
# define parameters
n_pop <- 10000
r_pop <- - .13
alpha_crit <- .05
# simulate values for privacy concerns
priv_con <- rnorm(n = n_pop, mean = 0, sd = 1)
# compute values for information sharing that are related to privacy concerns
inf_sha <- r_pop * priv_con + rnorm(n = n_pop, mean = 0, sd = 1)
# save as data.frame
d <- data.frame(priv_con, inf_sha)
Let’s first check whether the simulation worked.
cor.test(d$priv_con, d$inf_sha, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: d$priv_con and d$inf_sha
## t = -10, df = 10000, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.15 -0.12
## sample estimates:
## cor
## -0.13
Yes indeed, in our population we have a correlation of r = -.13.
Now, let’s imagine that we’re running a study to find out whether the privacy paradox exists. Because we cannot ask all 10^{4} students, we’re going to collect a sample.
How many? Good question. 200 hundred seems quite a lot, that should do the job, right?
Let simulate this study by randomly drawing a sample of 200 hundred participants of our population. Will we find an effect?
# define sample size
n_sample <- 200
# randomly define participants who are going to be selected for the study
id_sample <- sample(nrow(d), n_sample)
# create dataframe of subsample
d_sample <- d[id_sample, ]
# calculate correlation
results_complete <- cor.test(d_sample$priv_con,
d_sample$inf_sha,
method = "pearson")
print(results_complete)
##
## Pearson's product-moment correlation
##
## data: d_sample$priv_con and d_sample$inf_sha
## t = -1, df = 200, p-value = 0.2
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.231 0.044
## sample estimates:
## cor
## -0.095
As a result, we find a correlation of r = -0.1 and a p-value of p = 0.18. Hence, our result is not significant. Bummer!
Normally, we would now conclude that yes, the privacy paradox indeed seems to be a thing, because on the basis of the data we cannot reject it. Which would be false, of course.
We could also try again and see what happens. Let’s run the study again.
# define participants who we are going to be selected for the subsample
sample <- sample(nrow(d), n_sample)
# create dataframe of subsample
d_sample <- d[sample, ]
# calculate correlation
results_complete <- cor.test(d_sample$priv_con,
d_sample$inf_sha,
method = "pearson")
print(results_complete)
##
## Pearson's product-moment correlation
##
## data: d_sample$priv_con and d_sample$inf_sha
## t = -3, df = 200, p-value = 0.01
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.308 -0.039
## sample estimates:
## cor
## -0.18
This time, we find a correlation of r = -0.18 and a p-value of p = 0.01, which is significant. Hooray!
If we now repeat this a hundred times, this is what we would get:
# number of studies to be run
n_studies <- 2
# initialize object
results <- data.frame(study = 0, r = 0, p = 0, significant = TRUE)
# run simulation
for(i in 1:n_studies) {
study_no <- i
sample <- sample(nrow(d), n_sample)
d_sample <- d[sample, ]
results_complete <- cor.test(d_sample$priv_con,
d_sample$inf_sha,
method = "pearson")
results[study_no, ] <- data.frame(study_no,
results_complete$estimate,
results_complete$p.value,
ifelse(results_complete$p.value < .05, TRUE, FALSE))
# plot results
p <- ggplot(select(results, -p), aes(x = study, y = r, color = significant), frame = study) +
geom_point() +
theme_bw() +
xlim(0, n_studies) +
ylim(-.3, .1)
ggsave(paste0("figures/figure_", sprintf("%03d", study_no), ".png"))
}
# create gif
system2("magick",
c("convert", "-delay 30", "figures/*.png",
"figures/power_50_animated.gif"))
# remove individual pngs
file.remove(paste0("figures/", list.files(path = "figures/", pattern=".png")))
What do we see? Sometime we find significant effects, sometime we don’t. But given that the relation actually does exist in the population, that simply seems too much. How often have we been right?
mean(results$significant)
## [1] 0.5
In 1 cases. In other words, we only had a 50% probability, that is power, of finding the effect.
Instead of counting simulated data beans, it is also possible to calculate the achieved power statistically. For this you can use the r package pwr.
power <- pwr.r.test(n = n_sample, r = -.13, sig.level = .05)
print(power)
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 200
## r = 0.13
## sig.level = 0.05
## power = 0.45
## alternative = two.sided
As you can see, we get a very similar result: the power is 0.45.
Right, so let’s step back for a minute, it’s really important to understand what this all means. If we want to analyze the privacy paradox, we would more often than not claim that there is no effect, that the privacy paradox indeed exists, when in fact it does not! In other words, with 200 people we simply cannot analyze the privacy paradox – we would be more often wrong than correct. In short, our study is not informative, it doesn’t add anything to our understanding of the theoretical problem.
Of course this does not only pertain to the privacy paradox. It is valid for all research questions where you would expect a small relation (here, r = .13). Hence, if you read a paper analyzing a research question where you think “hmmm, this effect should most likely be small …”, and the study includes say 200 observations, you can stop reading at that point.
So what level of power would be ideal? Now, remember that the effect actually exists. So of course, we want our study to have a very good chance of finding that effect – otherwise, we would be wasting away important ressources, all the while risking to come up with false theoretical conclusions. So, in most scenarios it’s safe to say that the more power the better.
Often people quote Cohen (1992) and state that studies should have a power of 80%. However, maybe it’s just me, but I think that’s still too risky. If I only have a 80% probability of finding something that actually exists, I think I would rather invest more ressources and recruit additional participants. Personally, I would feel much more comfortable with 95% probability.
But how many participants would we need to collect in order to attain that probability? Again, we can estimate that using ‘pwr’.
power_req <- .95
power <- pwr.r.test(r = r_pop, sig.level = alpha_crit, power = power_req)
print(power)
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 762
## r = 0.13
## sig.level = 0.05
## power = 0.95
## alternative = two.sided
As we can see, in order to have a 95% chance of getting a significant result we would need to ask 762.32 people.
So let’s go back to our simulated data to see whether that really works!
# define sample size
n_sample <- power$n
# initialize object
results <- data.frame(study = 0, r = 0, p = 0, significant = TRUE)
# run simulation
for(i in 1:n_studies) {
study_no <- i
sample <- sample(nrow(d), n_sample)
d_sample <- d[sample, ]
results_complete <- cor.test(d_sample$priv_con,
d_sample$inf_sha,
method = "pearson")
results[study_no, ] <- data.frame(study_no,
results_complete$estimate,
results_complete$p.value,
ifelse(results_complete$p.value < .05, TRUE, FALSE))
# plot results
p <- ggplot(select(results, -p), aes(x = study, y = r, color = significant), frame = study) +
geom_point() +
theme_bw() +
xlim(0, n_studies) +
ylim(-.3, .1)
ggsave(paste0("figures/figure_", sprintf("%03d", study_no), ".png"))
}
# create gif
system2("magick",
c("convert", "-delay 30", "figures/*.png",
"figures/power_95_animated.gif"))
# remove individual pngs
file.remove(paste0("figures/", list.files(path = "figures/", pattern=".png")))
Indeed, it does. In 2 cases we found a significant result. In other words, we had a 100% probability (power) of finding the effect.
When trying to convince some colleagues of the necessity to run power analyses (at least in communication science, that unfortunately still includes mostly everyone), I have often heard the following response:
“I’d like to, but for this research question there isn’t yet a meta analysis that suggest the actual effect size – so it’s not possible to run meaningful a priori power analyses”.
That’s not true. In fact, instead of basing your power analysis on a meta-analysis (which is likely to be biased anyway), it is much more expedient to determine a so-called smallest effect size of interest (SESOI) (e.g., Lakens, Scheel, and Isager 2018). In other words, you want to determine an effect size that you think is already large enough to qualify as proof for your theoretical assumption. In turn, everything significantly smaller than the SESOI would be trivial.
But how to determine a SESOI? As a first (!) step, you could for example say that your effect should at least be small according to the conventions of Cohen (1992). More preferably, however, you should set real life criteria using unstandardized effects (but that’s another issue for another post). This SESOI you would then use for your power calculations.
In addition, setting a SESOI is crucial because p-values alone don’t suffice as claim for your theory. If your effect is trivial, small p-values cannot compensate. So it’s always both: Finding significance in order to determine the data’s surprisingness, and evaluating effects sizes to gauge the effect’s relevance.
Now what does this mean for our research, and especially for bachelor and master theses … ?
Above all, we want to run well-powered studies. For research questions where we need to expect small effect sizes, this means that we have to collect a large number of observations. In other words, for some research questions it simply needs a ton of ressources.
But what do we do if we don’t have much ressources? Fortunately, there are several valid options:
By now, there is a myriad of publicly available large-scale open datasets. Several of these include items designed by social scientists and allow to conduct high quality analyses of topical questions. In the following blog-post, I have compiled a list. I’m highly sceptical of the often-heard position that researchers always need to collect heir own data – informativeness trumps specificity.
Not the number of participants decides, but the number of observations. Sometimes it is possible to run within-person designs, which allow for research designs with more observations, research designs that are more informative and more efficient (e.g., Gelman 2017). Especially experiments are often easily be changed into a within-person design.
Researchers often team-up in in order to be able to collect sufficient observations. Most prominently, in psych there is the so-called Psychological Science Accelerator, which pools the resources of several labs in order to be able to design large-scale studies. In addition, an increasing number of researchers run so-called multi-site studies (without artifically focusing on cultural aspects). Also in BA, MA or PhD theses, it is highly advisable to join forces and to collect data together. Only because you advisor collected the data himself/herself, it does not mean that you have to as well.
It might sound depressing, but sometimes there’s no way around adapting or altogether leaving your research question. For example, if you’re interested in priming effects induced by the subtlest of changes to your stimuli – you either need a ton of ressources or a different research question. There’s no inherent right to analyze a specific research question – some are not feasible. But to our avail there are remedies: for example, it’s often possible to use stimuli that are more salient, adopt a different research paradigma, or change one or two general variables for more specific ones – all mechanisms that can increase your power.
Let us conclude: Power is extremely important. The empirical results of low-powered studies – however well-designed and theoretically crafted – don’t add anything to the literature. To determine adequate sample size, it’s crucial to run a-priori power analyses, preferably based on a smallest effect size of interest. There are several different options we can choose from in order to achieve studies with adequate power. During this process, some customs and cultures might need to change, yes, but as always it’s best to simply be the change you want to see. No reason to fret, we can do it.
Barnes, Susan B. 2006. “A privacy paradox: Social networking in the United States.” First Monday 11 (9). www.firstmonday.org/issues/issue11_9/barnes/index.html.
Baruh, Lemi, Ekin Secinti, and Zeynep Cemalcilar. 2017. “Online privacy concerns and privacy management: A meta-analytical review.” Journal of Communication 67 (1): 26–53. https://doi.org/10.1111/jcom.12276.
Cohen, Jacob. 1992. “A power primer.” Psychological Bulletin 112 (1): 155–59. https://doi.org/10.1037/0033-2909.112.1.155.
Gelman, Andrew. 2017. “Poisoning the well with a within-person design? What’s the risk?” https://statmodeling.stat.columbia.edu/2017/11/25/poisoning-well-within-person-design-whats-risk/.
Lakens, Daniël, Anne M. Scheel, and Peder M. Isager. 2018. “Equivalence testing for psychological research: A tutorial.” Advances in Methods and Practices in Psychological Science 1 (2): 259–69. https://doi.org/10.1177/2515245918770963.
Rouder, Jeffrey N., Richard D. Morey, Josine Verhagen, Jordan M. Province, and Eric-Jan Wagenmakers. 2016. “Is there a free lunch in inference?” Topics in Cognitive Science 8 (3): 520–47. https://doi.org/10.1111/tops.12214.